Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetch models results from HF #43

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

Samoed
Copy link
Contributor

@Samoed Samoed commented Nov 8, 2024

I've implemented a base version of retrieving results from the model card. This PR is a work in progress and is open for discussion. Currently, I’ve parsed the results for intfloat/multilingual-e5-large as a demo. During integration with other results, I ran into some issues, particularly with loading results using TaskResult. External results don’t have evaluation_time and mteb_version, which TaskResult requires.

@KennethEnevoldsen @x-tabdeveloping
Ref: embeddings-benchmark/mteb#1405, embeddings-benchmark/mteb#1373

@x-tabdeveloping
Copy link

This is awesome! You're doing god's work here :D Thanks

@x-tabdeveloping
Copy link

I think adding something invalid for the evaluation_time attribute like -1 would make quite a bit of sense. Otherwise we can also make it possible in the mteb library to allow None, though I'd prefer not to do this. What do you think @KennethEnevoldsen

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good! We should probably do this as a one-time thing only (don't support it going forward).

Comment on lines 4 to 5
"evaluation_time": 0,
"mteb_version": "0.0.0",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm maybe we should make it clear that this is NA

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(potentially by wrapping the original object to allow for None

@Samoed
Copy link
Contributor Author

Samoed commented Nov 8, 2024

You mean that we should fetch results only one time for each model and not check them for updates?

@KennethEnevoldsen
Copy link
Contributor

You mean that we should fetch results only one time for each model and not check them for updates?

Either that or we should update the create_meta CLI to also include MTEB version and eval time.

I would rather allow people to create a MTEB folder in the repo with the results.

1000 evaluation is also a bit hard to reason about:

Screenshot 2024-11-08 at 14 14 31

I would much rather have:

score on MTEB(eng): 0.86
score on MTEB(multilingual): 0.71
...

@Samoed
Copy link
Contributor Author

Samoed commented Nov 8, 2024

I would much rather have:
score on MTEB(eng): 0.86
score on MTEB(multilingual): 0.71

You mean that we should remove from mteb create_meta results of the tasks and left only average result of the benchmark? And all other results should be in repo of model?

@isaac-chung
Copy link
Contributor

Hey @Samoed wow thanks for working on this!

I think adding something invalid for the evaluation_time attribute like -1 would make quite a bit of sense. Otherwise we can also make it possible in the mteb library to allow None, though I'd prefer not to do this. What do you think

Re: evaluation_time: using -1 is likely easier for us to keep track of the results that needed attention down the line. I'd keep away from using None - right now having the field check is good.

we should update the create_meta CLI to also include MTEB version and eval time.

Let's do that for sure.

I would rather allow people to create a MTEB folder in the repo with the results.

I think this could be a great main option forward, instead of PRs to the results repo (i.e. both are accepted, but the MTEB folder is preferred). People's results should be generated with MTEB version and eval time (see above). For time to be meaningful, we might also want to track hardware info, right? Every time we scan for the MTEB folders, we can validate the results objects as well.

I'm all for automating as much as we can :P

@isaac-chung
Copy link
Contributor

isaac-chung commented Nov 8, 2024

Slightly related to the eval time topic, maybe we could start enforcing kg_co2_emissions to be filled in as well, as @x-tabdeveloping suggested? Maybe CodeCarbon takes into account both eval time and hardware together. Then this plot might be more meaningful when comparing different runs using different machines.

This will add 1 new dependency in MTEB but I think the benefits is worth it.

@Samoed Samoed marked this pull request as ready for review November 9, 2024 12:11
@KennethEnevoldsen
Copy link
Contributor

+1 for enabling code-carbon by default

re -1 for values: This would lead to issues when e.g. taking a mean. I would use math.nan instead (it is a float and an invalid value).

It sounds like for now:

  • Keep create_meta: it is still useful for adding certain tasks to metadata but it will not auto-update the leaderboard
  • Use only embedding-benchmark/results though planning in the future to add the option to use a folder on HF

@Samoed
Copy link
Contributor Author

Samoed commented Nov 9, 2024

For now what I should change, except -1 value?

@isaac-chung
Copy link
Contributor

@KennethEnevoldsen in this PR embeddings-benchmark/mteb#1398, -1 was used in favor of NaN. To be consistent, should we keep the -1 approach? Unless some metrics aren't between [0-1].

@KennethEnevoldsen
Copy link
Contributor

KennethEnevoldsen commented Nov 11, 2024

That PR is more about adding a warning to existing code (I also think it is a bad idea in that case)

Would def. change it to NA here.

@KennethEnevoldsen
Copy link
Contributor

@Samoed if we add the NA here (runtime, version) I believe this is ready to merge. It is one of the few missing requirements for the new release of the leaderboard 2.0

@Samoed
Copy link
Contributor Author

Samoed commented Nov 13, 2024

@KennethEnevoldsen I've updated results

@KennethEnevoldsen
Copy link
Contributor

Hmm I believe NaN is not valid json, shouldn't it be "null"? If it can be read I guess it doesn't matter.

for the MTEB version I would probably change it to:
"mteb_version": "unknown"

@Samoed
Copy link
Contributor Author

Samoed commented Nov 13, 2024

Yes, it’s a bit invalid as JSON, but Python can read it correctly.
image
Changing it to null would require updates to the MTEB loader, as it currently only accepts float values.
Now there are some tasks with NaN values too.

@Samoed
Copy link
Contributor Author

Samoed commented Nov 13, 2024

It also seems the loader requires the version to be a parsable string

@KennethEnevoldsen
Copy link
Contributor

I have added a PR which should allow the fix

@KennethEnevoldsen
Copy link
Contributor

@Samoed feel free to just merge it in

@Samoed
Copy link
Contributor Author

Samoed commented Nov 14, 2024

@KennethEnevoldsen I don't have permission to merge in this repository. Also, should we keep evaluation_time as math.nan, or change it to None?

@KennethEnevoldsen
Copy link
Contributor

Either is fine - I just wanted to make sure that the json was generally valid

@KennethEnevoldsen
Copy link
Contributor

@Samoed I think the update is in on main - updating to the latest version would allow us to get this merged in

@x-tabdeveloping
Copy link

Oh god, the problem seems to be that if the mteb_version is None, then TaskResult will try to load as if the results from before version 1.11.0, while the results are organized along the new format. @KennethEnevoldsen Is it necessary to assume, that if mteb_version is None then the version is <1.1.0? cause if so, we are in a bit of a deadlock. The only way I see around this error is either if we assume the modern result format when mteb_version is None or if we bump mteb version to be something relatively new in this PR.

@x-tabdeveloping
Copy link

Oooor (and I saw you did this at one point @Samoed ), we can say something like version=="unknown" and then when we load the results, we assume that it's the new format.

@Samoed
Copy link
Contributor Author

Samoed commented Nov 15, 2024

@x-tabdeveloping I think that in embeddings-benchmark/mteb#1453 issue with loading was fixed

@x-tabdeveloping
Copy link

what's up with the test failing then?

@Samoed
Copy link
Contributor Author

Samoed commented Nov 15, 2024

It was not updated after PR merge. I tried with new mteb verison, but found small issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants